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Abstract 

Mean profiles are widely used as indicators of the electricity consumption habits 
of customers. Currently, in Electricite De Prance (EDF), class load profiles are esti- 
mated using point-wise mean function. Unfortunately, it is well known that the mean is 
highly sensitive to the presence of outliers, such as one or more consumers with unusu- 
ally high-levels of consumption. In this paper, we propose an alternative to the mean 
profile: the Li-median profile which is more robust. When dealing with large datasets 
of functional data (load curves for example), survey sampling approaches are useful 
for estimating the median profile avoiding storing the whole data. We propose here 
estimators of the median trajectory using several sampling strategies and estimators. 
A comparison between them is illustrated by means of a test population. We develop 
a stratification based on the linearized variable which substantially improves the ac- 
curacy of the estimator compared to simple random sampling without replacement. 
We suggest also an improved estimator that takes into account auxiliary information. 
Some potential areas for future research are also highlighted. 

Key Words: Horvitz-Thompson estimator, k-means algorithm, poststratification, stratified 
sampling, substitution estimator, variance estimation. 
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1 Introduction 



The French electricity company, Electricite De France (EDF), uses extensively the customer 
class load profiles in distribution network calculation. Load profiles are also used to predict 
future loads in distribution network planning or to estimate the daily load curve of a new 
customer. The customer data usually includes information about the type of the electricity 
connection, the customer class, the consumption type and some other additional information. 
The combination of the individual customer informations and the class load profiles allows 
us to estimate its load curve. 

At EDF, the mean profile is used as an indicator of the electricity consumption of the 
customers. Nevertheless, it is well known that the mean is highly sensitive to the presence of 
outliers, for instance consumers with high level of consumption. As Small (1990) states, "it 
suffices to have only one customer contaminating a data set and going off to infinity to send 
the mean curve to infinity as well. By contrast, at least 50% of the data must be moved to 
infinity to force the median curve to do the same". More precisely, the median is robust to 
punctually extreme electricity consumptions of some customers and from a practical point 
of view, this can help to manage the electricity supply. Moreover, in the context of the 
electricity open market, new customers join the EDF company while other ones leave it and 
it is important to know the amount of electricity demand. Since the load profiles are not 
known for new customers, it is more difficult to predict their impact on the global electricity 
demand. Based on individual customer information, new customers will belong to a specific 
class and will be allowed the synthetic profile that describes the consumption behavior of its 
class. In these situations, robust profiles should be used and this is why, we suggest using 
the median profile besides the mean curve as a robust indicator for analyzing the population 
of electricity load curves. 

The median of a sample of univariate observations is a natural and useful characteristic 
of central position. Multivariate data, on the other hand, have no natural ordering. There 
are several ways to generalize the univariate median to multivariate data and they all have 
their advantages and disadvantages (see Small, 1990 for a survey of multidimentional me- 
dians). First uses of the multivariate median were limited to two-dimentional vectors and 
were motivated mainly by problems of quantitative geography (namely, of centro-graphical 
analysis) which were dealt with by the U.S.A. Bureau of census in the late 19th and early 
20th century. 

We focus here on the Li-median, also called the geometric or spatial median. Early work 
on the spatial median is due to Hayford (1902) and Gini & Galvani (1929) among others. 
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Its definition is a direct generalization of the real median proposed by Haldane (1948) and 
properties of the spatial median have been studied in details by Kemperman (1987). Iterative 
estimation algorithms have been developed by Gower (1974) and Vardi and Zhang (2000). 

In the next few years, the French company EDF intends to install over 30 millions elec- 
tricity smart meters, in each firm and household. A meter is an electronic device constructed 
for measuring the electricity consumption. These meters will be able to send individual elec- 
tricity consumption measures on very fine time scales. The new smart electricity meters 
will provide accurate and up-to-date electricity consumption data that can be used to model 
distribution network loads. In view of this new setting, the interest variables such as the 
consumption curve for example, may be considered as realizations of functional variables de- 
pending on a continuous time index t that belongs to [0, T] rather than multivariate vectors. 
Kemperman (1987), Cadre (2001) and Gervini (2008) studied the properties of the median 
with functional data. We cite also the very recent work of Cardot et al. (2011). 

Another important issue is data storage. The amount of load data will be enormous when 
all or almost all of the customers have smart meters. Collecting, saving and analyzing all this 
information, would be very expensive. For example, if measures are taken every 10 minutes 
during one year and if we are interested in estimating the total electricity consumption for 
residential customers, the data storage is of about 100 terabytes. 

We suggest using survey sampling techniques in order to get estimates of the median 
profile that are as accurate as possible at reasonable cost. The reader is referred to Fuller 
(2009) for a very recent monograph on survey sampling theory. Nevertheless, the idea of 
selecting randomly a sample from a population of curves is rather new. Chiky & Hebrail 
(2008) compare two approaches for estimating the mean curve. The first one consists in using 
signal compression methods for the whole population of curves and the second approach 
suggests taking a simple random sampling of the actual curves. Their conclusion is that 
the results are better in the latter situation even with rather simple sampling designs. Very 
recently, Cardot et al. (2010) developed the estimation of functional principal component 
analysis with survey data and Cardot and Josserand (2011) studied the properties of the 
mean curve estimator with stratified sampling. We cite also Chaouch and Goga (2010) who 
treated the estimation of geometric quantiles, the generalization of the spatial median, with 
survey data. As far as we know nothing has been done in the estimation of the Li-median 
in a functional framework with survey data whereas it might have great practical interest. 
This is why, we investigate in this paper the median curve estimator when several sampling 
designs and estimators are used. It is worth mentioning that the results presented in this 
paper may be applied for other functional data which are not necessarily related to time as 
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it was the case of the electricity data. Nowadays, functional data may arise in various other 
domains such as chemometrics or remote sensing and then the functional response variables 
depend on index t that may be a frequency and not necessary a time index. 

The paper is structured as follows: section 2 gives the main results concerning the median 
curve estimation with survey data. A weighted estimator for the median curve is suggested 
and its asymptotic variance function is exhibited by means of the linearization technique 
developed by Deville (1999). A variance estimator is also proposed. Section 3 gives the 
estimation of the median curve and of its variance function for a firm population of N = 18902 
load electricity curves. We consider several sampling designs: the simple random sampling 
without replacement, the systematic sampling, the stratified sampling with optimal and 
proportional allocation, and the with replacement proportional-to-size design. In the case of 
the stratified sampling, we suggest using the k-means algorithm to construct homogeneous 
strata with respect to the linearized variables. We illustrate through simulations that a 
substantial reduction compared to simple random sampling is obtained. We adapt to the 
functional framework the selection of a sample when auxiliary information is used at the 
sampling stage as for the with replacement proportional-to-size design. Finally, we improve 
the Horvitz-Thompson estimator of the functional median by considering the poststratified 
estimator. 



2 Functional Median in a Survey Sampling Setting 

Let us consider the finite population U = {1, . . . , N} of size N and a functional variable 
y defined for each element k of the population U : Yk(t), for t G [0,T], with T < oo. Let 
<•,•>, respectively || • ||, be the inner product, respectively the norm, defined on L 2 [0,T}. 
The empirical median trajectory calculated from Yi, . . . , Yjv is defined as (Chaudhuri, 1996 
and Gervini, 2008) 



N 

m N = argminy^ \\Y k - y\\. (1) 
2/eL2[o,T] k=l 

Supposing that Y k , for all k — 1, . . . , N, are not concentrated on a straight line, the median 
exists and is unique (Kemperman, 1987) and is the solution of the following estimating 
equation, 



fc=i 
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provided that ^ Yk for all k = 1 , . . . , N. 

For Yi, . . . , Yjy e M. d , the median defined by the formula Q arises as a natural gener- 
alization of the well-known characterization of the univariate median (Koenker and Basset, 
1978), 

N 

q = arg min | Y*. — 9\ 

k=l 

and it was called the spatial median by Brown (1983) or the Li-median by Small (1990). 
Weber (1909) considered as a solution to a problem in a location theory in which the 
Yi, . . . , Yn are the planar coordinates of N customers, who are served by a company that 
wants to find an optimal location for its warehouse. It is also known as the Fermat- Weber 
point. A geometrical interpretation of the median defined by ^ is that the centroid of the 

y ■ — ■ 

vectors rr is the origin in Mr. With only three points and bidimensional data, the 

H^fc - m N \\ 

median is known to be the Steiner point of the triangle Y1Y2Y3. The spatial median has 
also origins in the early work during the twelve census in the United Sates in 1900 concerned 
by finding the geographical center of the population over time. Hayford (1902) proposed the 
point-wise median as the geographical center but explicitly noted the drawback of the fact 
that the point-wise median depends on the choice of the orthogonal coordinates and it is 
not equivariant under orthogonal transformations. Brown (1983) goes further with this idea 
and states that when dealing with spatial data where variables possess isometry and require 
statistical techniques that have rotational invariance, it is more appropriate to use a median 
that shares these properties. We recall that with observations Yi,...,Y/v that lie in M d , 
the point-wise median is the <i-dimensional vector of medians computed from the univariate 
components and for functional variables, the empirical point-wise median is obtained if the 
L 1 norm is used in ([!]) instead of the L 2 norm, 

N 

med(t) = arg min V \Y k {t) - y(t)\, for all t e [0,T]. 

To illustrate the mean curve and the point-wise median versus the spatial median, we 
plot in Figure [T] the three curves for the test population of iV = 18902 companies considered 
in section |3j The electricity consumption is measured every 30 minutes. 

Chaudhuri (1996) shows that the geometric quantiles defined in formula ^ from below 
and which are a generalization of the median defined by Q are equivariant under orthogonal 
transformations unlike the point-wise median. Moreover, Chaudhuri showed also that the 
spatial or the L\ median is equivariant under any homogeneous scale transformation of 
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Figure 1: The spatial median profile is plotted in dashed line, the point-wise median profile 
in dotted line and the mean profile in solid line. 

the coordinates of the multivariate observations which is appropriate when one needs to 
standardize the coordinate variables appropriately before computing the median. 

The main arguments that play in favor of the spatial median are the uniqueness (see 
e.g. Chaudhuri, 1996 for the <i-dimensional case with d > 2 and Kemperman, 1987 for the 
functional case) and the fact that it is a global and central indicator of the distribution of 
the data. More exactly, the spatial median takes into account all instants making the spatial 
median a central indicator of the distribution of the data while the point-wise median is a 
central indicator but only for each instant. Consider for example that we have consumption 
electricity data recorded during two weeks: one working week and one holiday week such 
as the Christmas week. We compute first the spatial, respectively the point-wise median, 
by considering only the working week time measurements. Next, we consider the two week 
consumption electricity and we compute again the spatial, respectively the point-wise me- 
dian. It can be noticed that the coordinates of the point-wise median that correspond to the 
working week are the same in both situations while they are changed for the spatial median. 
Since is due to the fact that the spatial median is computed by taking into account all the 
time measurements while the point- wise median is computed instant by instant. 

Moreover, Brown (1983) shows that there is an asymptotic efficiency from using the 
spatial median instead of the point-wise median. In fact, one can see that the objective 
function is differentiable in the case of the spatial median while this property is not fulfilled 
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for the point-wise median. 

As noted by Serfling (2002), the median defined by Q and Yi, . . . ,Y N G M. d , has a nice 
robustness property in the sense that m N depends only on its direction towards Y{. More 
exactly, remains unchanged if the Yi are moved outward along these rays while it is 
obvious that the point-wise median will change. 

Remark: Chaudhuri (1996) extends the definition ([I]) to geometric quantiles by using the 
geometry of data clouds. In a functional setting, its definition indexes the quantiles by the 
elements v G L 2 [0,T] with \ \v\ \ < 1, 

TV 

Q(v) = argmin V" (\\Y k - y\ |+ < Y k - y, v >) . (3) 

In this way, functional quantiles are characterized by a direction and magnitude specified by 
v G L 2 [0,T] with H?; 1 1 < 1. Nevertheless, except the case v = 0, it is difficult to interpret 
the functional quantile defined in this way. This is why, our discussion is limited to the case 
v = 0. 

2.1 The design-based estimator for the median tun 

Algorithms have been proposed to solve the equation ^ (Vardi and Zhang, 2000, Gervini, 
2008) but they need important computational efforts especially when the number of time 
measurements is large. In this work, we suggest estimating the median curve mjv by taking 
only a sample s from U according to a sampling design. A probability measure p(-) on 
the set of subsets of U, henceforth denoted V(U), is called a sampling design. Any random 
variable S with values in V(U) and distribution p, is called a random sample associated to 
the sampling design p. Let s be a realization of S. For any k G U, the inclusion probability 
of k is given by 

ir k = F(keS) = Y,P(s)> 

fees 

where the sum is considered over all samples s containing the individual k. If k I are two 
elements of U, the second-order inclusion probability of k and I is given by 

n kl = F(k,leS) = J2p(*)> 

k,l£s 

where the sum is considered over all samples s containing both k and I. 
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In practice, a wide variety of selection schemes are used. We distinguish direct element 
sampling designs such as the simple random sampling without replacement (SRSWOR), 
stratified sampling (STRAT) or proportional-to-size sampling designs (with or without re- 
placement). Most of these designs are used extensively in practice. However, such designs 
require having a sampling frame list identifying every population element which may be 
difficult, expensive or even impossible to realize. In order to avoid it, more complex designs 
such as cluster or multi-stage designs can then be used. This is for example appropriate 
when the population is widely distributed geographically or may occur in natural clusters. 
Using such designs saves money and human efforts but entails a loss of efficiency. A detailed 
presentation of the survey sampling theory and many practical illustrations can be found 
in Korn and Graubard (1999), Lehtonen and Pahkinen (2004) and the reference book of 
Sarndal, Swensson & Wretman (1992). 

The median given by ^ is a nonlinear parameter of finite population totals defined by 
an implicit equation. In order to estimate m^, we use the functional substitution approach 
proposed by Deville (1999) for multivariate variables y and extended to functional variables 
y by Cardot et al. (2010). Let M be the discrete measure defined on L 2 [0, T] assigning the 
unity mass to each curve Yk with k e U and zero elsewhere, namely 

where 5y k is the Dirac function in Y^. The total mass of M is N, the population size. Let T 
be the functional with respect to M and depending of y as follows 

Remark that T defined in this way is the derivative with respect to y of the objective function 
given in Q. The median is then defined as an implicit functional with respect to M, 

T(M;m N )=0 (5) 

or equivalently, 

y - m N 



II v ndM = 0. (6) 

\\y - m N \\ 

Let M be a weighted estimator of M based on the sample s, 

M = ^ W k&Y k = ^ I k W k^Y k , (7) 



S 



where Ik = l{fces} is the sample membership indicator of element k G U (Sarndal et al, 1992). 
In fact, M is also a discrete and finite measure assigning the weight Wk for each with fcGs 
and zero elsewhere. Usually, one take Wf. = l/vr^, the Horvitz-Thompson weights. In this 
case, we obtain the Horvitz-Thompson (1952) of M which estimates unbiasedly the measure 
M since E p (Ik) = iTk, for ah k G U for E p (-) the expectation with respect to the sampling 
design p(-). The reader is referred to Cardot et al. (2010) and Cardot and Josserand (2011) 
for more details about the Horvitz-Thompson estimation with functional data. However, 
for Y\, . . . , Yn lying in M d , weights that take into account auxiliary information have been 
suggested. We mention Deville (1999) for calibration weights or the very recent work of 
Goga and Ruiz-Gazen (2011) for nonparametric weights. Nevertheless, the extension to the 
functional case is not straightforward and it will be treated elsewhere. For the rest of the 



paper, we consider Wk = an d in section 3.1, we suggest the poststratified estimator of 
M. 

Plugging M into the functional expression of given by (5), yields the design-based 
estimator fh n of m^. Hence, m n verifies 

T{M-m n ) = 0, 

namely, fh n is the solution of the design-based estimating equation, 



E 



1 Y k -rhr 



. n k Wk - rh n \\ 
fees 

Supposing now that for all k G s, Y^ ^ m n and that Y}~ are not concentrated on a straight 
line, we obtain that the solution m n exists and is unique following the same arguments as 
in Kemperman (1987) or Chaudhuri (1996). The median estimator rh n is also called the 
substitution estimator of and it is defined by a non-linear implicit function of Horvitz- 
Thompson estimators. As a consequence, the variance as well as the variance estimator of 
fh n can not be obtained directly using Horvitz-Thomson formulas. We will give in the next a 
first-order expansion of fh n in order to approximate m n by the Horvitz-Thompson estimator 
for the finite population total of appropriate artificial variables. 



2.2 Asymptotic properties 

The functional T given by Q is Frechet differentiable (Serfling, 1980) with respect to the 
measure M and y. Let T be the Jacobian operator of T with respect to y and given by 
(Gervini, 2008) 
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where I is the identity operator defined by ly = y and the tensor product of two elements a 
and b of L 2 [0, T] is the rank one operator such that a £g> b(y) =< a,y > b for all y G L 2 [0, T}. 
One can easily obtain that T is a strictly positive operator, namely < Ty, y » and 
supposing that iV -1 J2keu ll-^fc — m v|| _1 < oo, we can get following the same arguments 
as in Cardot et al. (2011), that Y/N is a bounded operator, namely HT/A^Hoo < oo with 
||r||oc = sup|| y || <:L | \Yy\ |. We recall that for the operator Y : L 2 [0,T] — > L 2 [0,T], we have 

ry = J2 uv 1 ii (y- < Iv' mN t > ^- m ^) fora11 y^ L2 M 

\\Y k -m N \\ V \\Yk-m N \\ 2 J 

which gives 



fce(7 



where 



7M) = E 



Yfc - m^vl 

(y fe (r) - m N (r))(Y k (t) -m N (t)) 
\Y k - m N \\ 3 



keu 

The median is defined by the implicit equation ^ and using then the implicit function 
theorem, we obtain that it exists a unique functional T such that 

f (M) = m N 

and 

f (M) = m n . 

Moreover, the functional T is also Frechet differentiable with respect to M and the derivative 
of T with respect to M is called the influence function and defined, when it exists, as follows 

where 5^ is the Dirac function at £ G L 2 [0,T]. Note that this definition suggested by Deville 
(1999) and extended to the functional case by Cardot et al. (2010) is slightly different from 
the usual definition of the influence function used in robust statistics (see e.g. Hampel, 1974 
or Serfling, 1980), which is based on a probability distribution instead of a finite measure M. 



10 



A nonstandardized measure M is used in survey sampling because the total mass N may be 
unknown. 

Under the asymptotic framework from Deville (1999), we may give a first-order von-Mises 
(1947) expansion of f in M/N around M/N, 

~(m\ ~{m\ r ~(m\(m m\ , s , , s 



N J \N J J \N'*J \ N N 
which may be written in the equivalent form, 

f(M) = f(M) + J IT (M, d(M - M) (f) + o^n' 1 / 2 ) (12) 

since T is a functional of degree zero, namely T(M/N) = T(M) and in this case, IT \ = 
N ■ IT (M,£) (Deville, 1999). 

Let Uk, for all k £ U, be the linearized variables of T(M) = and defined as the value of 
the influence function IT at £ = Yk, namely 

u k = lf{M,Yk) = T^( ^- mN \. (13) 

\\\Yk - m N \\J 

We have used here the fact that for fixed y, the functional T(M; y) = — tj— - — ^77 is a 

finite population total with influence function at Y k given by IT(M, Y k ; y) = —(Y k — y)/\\Yk — 
y\\ (Deville, 1999). From the Riesz's theorem, we have that for all bounded h £ L 2 [0,T] 
there is a unique / £ L 2 [0,T] such that Tf = h and Tf(g) =< h,g > for all g £ L 2 [0,T]. 
This unique / will denote T~ l h for a given h £ L 2 [0,T]. 



Hence, the expansion (12) becomes 

Uk 



m n = m N 



+ £--5> + °^- 1/2 )- (14) 



fces K keu 



The above formula shows that the nonlinear estimator m n may be approximated by the 
Horvitz-Thompson estimator for the total of the linearized variables it*.. In this way, Uk is 
an artificial variable used to compute the approximative variance of m n . Now, the linearized 
variable Uk is also a functional defined on L 2 [0,T] and it is unknown since and T are 
unknown. We suggest estimating Uk by 

u k = r- ( ij^ij) , (15) 

V \ \Y k - m n \\J 
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where T is given by 



kes 



n k \\Y k - m n \ 



(Yfc - m n ) ® (Y k - fh n ) 
\\Y h - mjp 



(16) 



Using relation (14), one can obtain the asymptotic variance function of rh n calculated under 
the sampling design, 



var 



(t) = £ J>« - t^tt,) . !^ . ^ = „'(f)Auif ! for all f € 0. 7 ] 



where u(i) = (u k (t)) ke u with u k {t) is given by (13) and A 



variance is estimated by 



K k iti 



k,ieu 



var 



kGs fcSs 



(17) 



The 



18) 



where u s (t) = (u k (t)) kes with iifc given by (15) and A 



Kkl - 7l k 7Tl 



Remark 1: It is worth mentioning that the linearized variable u k plays a central role for 
the estimation of the median. More exactly, the efficiency of any sampling design used for 
estimating the median curve depends on how well it estimates the total of the linearized 
variable u k . For example, a stratified strategy will be efficient if the strata are homogeneous 
with respect to u k as it will be showed below. Nevertheless, to put in practice such a design 
requires knowing all u k which may not be readily available. 



Remark 2: In practice, we observe the curves Y k at D discretized points, say < t\ < £2 < 
. . . < £d < T that we suppose to be the same for all the curves. When the discretization 
points vary to one curve to another, methods described in Ramsay and Silverman (2005) 
may be employed. In order to compute numerical approximations to integrals and inner 
products, quadrature rules are used. 

With discretized points, curves may be viewed as multidimensional vectors, in our case, 
YJ!, = (Y k (ti), . . . , Y k (tr>)) and u' fc = (u k (ti), . . . , u k {t£,)). For each k e s, we need to compute 
the estimated linearized variable in points t\, . . . Let u s = (u' k ) kes be the sample vector 
of estimated linearized variables which can be derived by solving the D x n dimensional 

system 

/ Yi - m n Y N - m n \ 

V Yi-m n \\Y N -m n \\J 
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where T given by ( 16 ) is replaced by a D x D symmetric matrix. The variance estimator is 
then derived directly using (18). 



3 Application to the EDF load curves 
3.1 General settings 

The volume of data treated and analyzed by Electricite De France is increasing greatly. In 
fact, in the next few years Electricite De France plans to install millions of smart electricity 
meters that will be able to send, on request, electricity consumption measurements every 
second. Obviously, it will be difficult to store and analyse online all these information. The 
statistician's challenge is to find a strategy, meaning indicators and estimation methods, 
capable to give a good description of data and to used it for forecasting. While working with 
huge data, methods not being time-consuming are highly desirable. 

Our proposal consist in considering the median curve as a robust indicator of the data 
and estimating it with probability sampling designs. As Lohr stated in "Sampling: Design 
and Analysis" (1999): // a probability sampling design is implemented well, an investigator 
can use a relatively small sample to make inferences about an arbitrarily large population. 

Let U be a population of iV = 18902 electricity meters installed in small and large 
companies sending every 30 minutes the electricity consumption during a period of two weeks. 
We aim at estimating the median curve of the electricity consumption during the second week 
whereas the consumption recorded during the first week will be used as auxiliary information. 
This means that we have 336 time measures. So, our study population of curves is a set of 
N = 18902 vectors = (Y fc (ti), . . . , Y k (t D )) with D = 336. Let X k be the consumption 
curve for the kth firm and recorded during the first week. The consumption curves present 
low peaks corresponding to night time measurements and high peaks corresponding to middle 
day measurements. The electricity consumption decreases roughly around the 250th time 
measurement which corresponds to the beginning of weekend time. The mean and median 
curves present the same effect as we can see in Figure [Tj 

We consider several strategies of fixed size n = 2000 and we compare them through 
simulations. We distinguish two kinds of sampling designs whether they use or do not use 
auxiliary information. If auxiliary information is used at the sampling stage, some changing 
are needed because the variables involved now are curves. On the opposite situation, the 
selection of the sample is realized from the sampling frame list as for classical multivariate 
surveys. Finally, the frame list of French firms is well-constructed being very often updated 
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and most of the designs considered below are usually used in practice. 



1. Simple random sampling without replacement (SRSWOR). 

The SRSWOR sampling is a very simple design easy to put into practice. Every 
possible subset of n units in the population has the same chance to be the sample. In 
a functional framework, the selection of a sample of n = 2000 curves is performed as 
for the multivariate surveys, namely n labels are drawn from the list of iV companies. 
The estimation of the median curve with SRSWOR is obtained from equation ^ for 
7i"fc = n/N, namely m n is the unique solution of the following equation 



fces 



The asymptotic variance function is equal to 

var SRSWO R{t) = N 2 fl-IJ Sl (thU , for all t 

where S^ (t))£/ = J2keu( u k( t ) -^it)) 2 / (N - 1) with u(t) = J2keu u k( t )/ N - Tllis variance 
is estimated by 



var S RSwoR(t) = N 2 Q - S2(t),*» for a11 * 



(20) 



where S 2 , t)s = J2kes(^k{t) -u s {t)) 2 /(n- 1) with u a (£) = Y,kes^k{t)/n and w fc (t) given 



by (15) 



2. Systematic sampling (SYS). 

We consider the systematic design in its basic form (Sarndal et al, 1992). The inclusion 
probabilities are ir^ = n/N, so the median estimator is obtained according to the same 



equation (19). It is well-known that the systematic sampling may be very inefficient 
compared to the SRSWOR sampling if the systematic samples are homogeneous. One 
way to improve the efficiency of SYS sampling is to order the sampling frame list 
according to an auxiliary variable highly correlated with the variable of interest. In this 
way, adjacent elements tend to be more similar than elements that are farther apart. In 
our study, we ordered the frame according to the mean electricity consumption during 
the first week, namely the variable = Yld=i-^k(td)/D, for all k G U and D is the 
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number of discretization points in [0, T\. Another trade-off for the simplicity of SYS 
sampling is that there is no unbiased estimator of the design variance function since 
7Tfcz = for all k and / not belonging to the same systematic sample. However, using 
the ordering according to the variable the SYS is at least as good as the SRSWOR 
sampling. So, we might use the variance estimator appropriate for the SRSWOR design 



given in (20) 



Systematic sample is really a special case of cluster sampling, so it is often used when 
it is difficult to construct a sampling frame in advance. 

3. Stratified sampling with simple random sampling without replacement within strata 
(STRAT). 



In this case, the population is divided into H nonoverlapping strata denoted Uh and a 
simple random sample without replacement is selected independently in each stratum. 
Let rih be the sample size within stratum h and Nh be the stratum size. To obtain the 
median estimator, we solve the estimation equation 

X X " =0, 



h=l s h 



n%\\Y k - m n \ 



where Sh = s PI Uh and tt% = Uh/Nh- It is well-known that stratification may substan- 
tially improve the quality of estimates compared to simple random sampling without 
replacement and systematic sampling if the strata are well-constructed. More exactly, 
the more homogeneous the strata are the more efficient the stratification is. It is worth 
mentioning that improving the estimation of the median curve means constructing 
strata homogeneous with respect to the linearized variables, Uk- Indeed, relation (17) 
gives us that the asymptotic variance function of rh n with STRAT is 



u{t),U h i 

\ ill. . v ; , / 

h=l 

where S^m Uh is the population variance within stratum h of u(t) = (uk(t))keu h - That 
is, the lower the variation of the linearized variable within stratum, the lower the 
asymptotic variance of rh n . The variance estimator is the sum of variance estimators 
(20) computed within each stratum. 

Usually, one builds the stratification using a variable known on the whole population 
and strongly correlated with the variable of interest. In our case, we suggest two 
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stratification variables computed using the first week: the first one is the linearized 
variable Uk and the second one is the consumption Y k (Cardot and Josserand, 2011). 
The following two sample allocations are used: 

• the proportional allocation (PROP): = nN^/N for all h — 1, . . . H. 

• the -u^-optimal allocation (m 1 -OPTIM) as suggested by Cardot and Josserand 
(2011) and computed here with respect to the variance S 2 W ,^ rr of the linearized 



variable computed during the first week and denoted by u 



Jo ^«a: 



(t),u h 



dt 



n 



(i) 

k ) 



H. 



l \t),u h 



dt 



The M^-optimal allocation is similar to the Neyman optimal allocation but com- 
puted using vP~> instead of u. The x-optimal allocation is obtained when the 
consumption during the first week is used. 

Stratification based on the linearized variable during the first week 
The proposed strategy can be split into two steps: 

Step 1: we calculate the linearized variables vf£ for all k G U during the first week. 

Step 2: we stratify the population U using the k- means clustering algorithm with the 
euclidean distance and applied to the linearized variables for k e U. According to 
within cluster variance considerations, we decide to keep H = 4 different clusters. The 
strata sizes as well as both the proportional and w^-optimal allocation are given in 
TablaH 



Stratum number 


1 


2 


3 


4 


Stratum size N h 


6767 


2420 


2503 


7212 


PROP allocation 


716 


256 


265 


763 


uW-OPTIM allocation 


525 


395 


428 


652 



Table 1: Strata sizes, proportional and u^-optimal allocations when n = 2000. 

We plot in Figure [2] (a), the mean of u k computed during the second week and within 
the H = 4 strata. Differences among the strata means are noticeable accounting for 
a significant gain in efficiency if the proportional allocation is used. Now, to better 
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see what kind of consumers the four strata are built of, we plot in Figure [2] (b) the 
mean of the consumption Y^. We remark that the stratification based on induces 
a stratification for the consumption curves also. 



Mean of linearized variables within strata 



Mean of consumption within strata 




— I 1 1 1 1 — 

100 150 200 250 300 




100 150 200 250 300 
Hours 



(a) 



(b) 



Figure 2: Stratification based on the linearized variable: (a) Mean of linearized variables u\~ 
within each stratum, (b) Mean of the consumption curve within each stratum 



Stratification based on the consumption curve during the first week 

Cardot and Josserand (2011) suggested taking H = 4 strata corresponding to the 
maximum level of consumption during the first week and based on quartiles so that 
all strata have the same size. We denote the allocation obtained in this way by x- 
OPTIM. The strata sizes as well as both the proportional and the x-optimal allocation 
are given in Table |2j 

We plot in Figure [3] (b), the consumption mean within strata and during the second 
week. We notice that the stratum 4 corresponds to consumers with high global levels 
of consumption, whereas stratum 1, corresponds to consumers with low global of con- 
sumption. Figure [3] (a) gives the mean curves of the linearized variable within strata 
and computed for the second week. As for the first stratification, the population of the 
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Stratum number 


1 


2 


3 


4 


Stratum size N h 


4725 


4726 


4725 


4726 


PROP allocation 


500 


500 


500 


500 


a;-OPTIM allocation 


126 


212 


333 


1329 



Table 2: Strata sizes, proportional and x-optimal allocations for n = 2000. 



linearized variable curves is also stratified. 



Mean of linearized variables within strata Mean of consumption within strata 




Hours Hours 



(a) (b) 

Figure 3: Stratification based on the consumption curve: (a) Mean of linearized variables Uk 
within each stratum, (b) Mean of the consumption curve within each stratum 



4. Proportional-to-size sampling (PPS) 



Unequal probability designs are widely used in practice because they are usually more 
efficient than the equal probability designs. In PPS sampling, the sampling is with- 
replacement and the probability pk with which the individual k is selected is propor- 
tional to a positive measure Xk, where Xk is an auxiliary variable roughly proportional 
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to the study variable Y k . The probability of selection has the expression 

X k 

Pk = ^ 7^- 

In our situation, the study variable is a curve and so is the auxiliary information. To 
cope with this problem, we suggest using p k proportional to the mean of X k (t) over all 
t — \,...,D where D is the number of discretization points in the interval [0,T]. This 
means that 

X k 



Pk 



Ylkeu Xk 



where X k = Y^t=i X k (t)/D. For our study, we consider again X k as being the electricity 
consumption for the fcth firm recorded during the first week. The inclusion probabilities 
are given by n k = 1 — (1 — p k ) n . The Horvitz-Thompson estimator of the median is 
obtained by solving the equation 

V !l~^ M =0, (21) 
j-i n k \\Y k - m n \\ 

k£s 

where s is the set of distinct elements of s. In with-replacement designs, one may 
use the Hansen and Hurwitz (1943) estimator which presents the advantage that the 
variance formula is easier (no double sums are needed). 

5. Poststratification (POST) 



Let consider now the poststratification which is one of the simplest way to take into 
account auxiliary information in order to improve the Horvitz-Thompson estimator 
of the median. We suppose that the population is partitioned into subpopulations 
Ui, . . . , Ug according to a given classification principle. These subpopulations are called 
poststrata since they do not serve for performing stratified sampling as described before. 
Practical considerations may favor some other (perhaps simpler or less costly) designs, 
such as SRSWOR from the whole population U. After the sample selection, Y k is 
observed for the elements k e s and the sampling frame is used to establish the 
group each individual belongs to. Remark that group memberships may be unknown 
before the sample selection which makes impossible to perform the stratified sampling. 
Nevertheless, the group membership totals N g are known for all g — 1, . . . , G and this 
auxiliary information may be used to construct an improved estimator of ttin- The 
weights used in this case are given by 

w ks = N g / (Ngir k ) for all k G s g = s n U g 
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where N g = Y^kes l/ 71 "*- Hence, the poststratified estimator of m^v is obtained by 
solving the following equation 



Q 

v- v- N g Y k - m r 



(22) 



It is important to notice that the weights used here depend on the sample and are 
more general than the Horvitz-Thompson weights used before in the sense that they 
include the auxiliary information given by the group size N g . In this case, the auxiliary 
information is used at the estimation stage and not at the design stage. 
With SRSWOR sampling from U, the poststratified weights become Wk = N g /n g for 
all k e s g and n g the size of s g . The median estimator verifies 



G 

x x N a Y k — fh n 

}}—twt ^V=0. 23 

Z-^t Z-^i n \\V, — m I 
9=1 kes g 9 



n g \ \Y k - m..,, 

One should remark that in the case of the poststratification, the samples sizes n g are 



random. The poststratified estimator of the median given by (23) cannot be computed 
if some sample sizes n g are equal to zero. However, if the total sample size n is large 
enough, and if no group accounts for a very small proportion of the whole population, 
then the probability of having n g = is very small. Sarndal et al. (1992) suggest 
aggregating small groups in order to guarantee that n g are at least 20. 
The poststratified estimator is a calibrated estimator (Deville and Sarndal, 1992) when 
the auxiliary information is the group memberships with known totals. Using the 



same arguments as in Deville (1999), the expansion given in (14) remains valid and 
as a consequence, the asymptotic variance of m n is equal to the variance of residuals 
Ek = Uk — u g , u g = J2keu U k/N g and with SRSWOR, we obtain 



g=l 



where v is the group variance. We can remark that the variance agrees very nearly 
with the asymptotic variance for the stratified sampling with proportional allocation. 

We can conclude that the SRSWOR with poststratification is essentially as efficient as 
STRAT sampling with proportional allocation unless the sample is very small. This re- 
sult is well-known for multivariate variables Yk (see e.g., Sarndal et al. 1992). We have 
developed above a strategy based on the linearized variables to obtain well-constructed 
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strata. One drawback with that method is that the so-constructed strata reduce the 
variance for the median estimator but could be inefficient for many other variables. 
Therefore, using SRSWOR with poststratification will often improve overall efficiency. 



Consistency of m n (t) from survey data 

Focus is now on the estimation of the median curve of the consumption recorded during the 
second week. We consider for that the following sampling designs of size n = 2000 with the 
Horvitz-Thompson estimator: SRSWOR, STRAT based on the two stratification variables 
and with proportional and optimal allocation, SYS, PPS. We consider for our study the 
POST estimator also. Figure Q shows the estimation of the median curve computed from 
one sample selected according to four sampling designs. 

In order to compare these designs, we made 500 replications and considered the following 
loss criteria, 



R{m n ) = I \m n (t) - m N (t)\dt. 
Jo 

Since in our study we have equally spaced discretized time measurements, the above loss 
criterion is approximated due to quadrature rules by J2d=i \™n{td) — m 7v(^)l- Basic statis- 
tics, respectively boxplots, for the estimation errors of the median function estimator are 
given in Table |3j respectively Figure [5j The stratification variable used here is the one based 
on the linearized variables u k . First, we can observe that clustering the space of functions 
by performing stratified sampling leads to an important gain in terms of accuracy of the esti- 
mators, dividing by at least two times the mean error compared to simple random sampling 
without replacement. We note that the poststratification gives results similar to those given 
by stratified sampling with proportional allocation and that the SYS design with ordering 
on the first week mean consumption is almost as good as the SRSWOR design. A rather 
surprising result is obtained with PPS sampling. Simulations results not reported here show 
that this design performs very well for estimating the mean consumption curve, being as 
good as the stratified sampling but fails for the median curve. We believe that this fact is 
due to numerical problems encountered in the resolution of the implicit equation (21) when 
a large number of small probabilities of selection pk are used to estimate mjy. More research 
is needed to better clarify this issue and to find a way to improve it. 
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Hours 



Figure 4: One sample estimation of the median trajectory for n = 2000 with 4 sampling 
strategies (SRSWOR, SYS, STRAT.OPT, STRAT.PROP). 

We also performed simulations for the second stratification based on the first week consump- 
tion Xk and the results are given in Table |4} 

We can remark that STRAT with proportional allocation and stratification based on vP^ 
gives better results than STRAT with x-optimal allocation stratification based on X. This 
result is not surprising since in the latter case, the strata have been constructed taken into 
account the consumption variable and the optimal allocation has been computed by mini- 
mizing the variance for the mean estimator while we are interested here in estimating the 
median curve. This is why, the proportional allocation is usually advisable with multipurpose 
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Figure 5: Comparison of the distribution of estimation errors of the median curve for SR- 
SWOR, PROP, OPTIM, SYS. 
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Mean 


1 st quart ile 


median 


3 rd quartile 


SRSWOR 


2.531 


1.322 


1.982 


3.351 


SYS 


2.625 


1.355 


2.412 


3.087 


PROP 


1.060 


0.8498 


1.017 


1.234 


uW-OPTIM 


1.000 


0.7946 


0.9552 


1.142 


POST 


1.041 


0.8275 


0.9785 


1.203 


PPS 


7.1410 


2.7880 


6.1370 


9.5600 


Table 3: Estimation errors for m^. 




Mean 


I s * quart ile 


median 


3 rd quartile 


PROP 


1.7370 


1.0470 


1.4860 


2.2480 


x-OPTIM 


2.2940 


1.4660 


1.9790 


2.7830 



Table 4: Estimation errors for with stratification based on Y^. 

surveys. Moreover, if we compare the two stratifications, we remark that the stratification 
based on the consumption variable is less efficient than the stratification based on the lin- 
earized variable but it remains still better than the SRSWOR or SYS designs. 

Both stratifications used in this paper need the consumption curve computed during 
the first week for all the individuals from the population. Sometimes, this can be too 
costly to obtain or even impossible because of storage or confidentiality constraints. In such 
situations, some other stratification variables may be considered such as for example, the 
electricity power given by the subscribed contract between one firm and EDF. 

Consistency of the variance function estimation from survey data 

We analyze in the following the estimator for the variance function var(t) when the SRSWOR 
and STRAT designs are used. To judge the quality of the estimators, we use the following 
criterion 

f T 

R{var) = / | var (t) — var (t) \ dt. 
Jo 

We give in Table [5] statistics for the estimation errors of the variance function estimation with 
SRSWOR and stratified sampling with proportional and w^-optimal allocations. Figure M 
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Figure 6: Theoretical standard deviation function of rh n (t) for simple random sampling 
without replacement (solid line), stratified sampling with proportional allocation (dotted 
line) and stratified sampling with -u^-optimal allocation (dashed line). 

gives the theoretical standard deviation function curves of fh n , y/var(t), with the considered 
designs. 





Mean 


I s * quartile 


median 


3 rd quartile 


SRSWOR 


0.599 


0.339 


0.506 


0.750 


PROP 


0.068 


0.055 


0.064 


0.076 


uW-OPTIM 


0.056 


0.047 


0.053 


0.062 



Table 5: Statistics about the estimation errors for var(t). 

One can remark that the theoretical variance is much smaller, at all instants t, for the 
stratified sampling with optimal allocation rule. The stratified sampling with optimal allo- 
cation gives more accurate estimation of var(t) than the other strategies. We can observe 
that clustering the space of functions by performing stratified sampling may leads to a con- 
siderable gain in terms of accuracy of the estimators of the variance function, dividing by 
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ten the mean error compared to simple random sampling without replacement. Moreover, 
there is also a difference between proportional and optimal allocations rules, for example the 
third quartile in optimal case is lower than the median loss in the proportional case. 

4 Conclusion and perspectives 

In this paper, we have developed a survey sampling approach for estimating the median 
of a functional variable. From a practical point of view, an appealing consequence of the 
new methodology is that the proposed estimators are faster to calculate. The experimental 
results on a test population of electricity consumption curves confirm that even with high 
dimensional data, stratification associated with the optimal allocation rule leads to important 
reduction of the variance estimators. Having appropriate strata is the key for getting more 
accurate estimators and the k-means algorithm is well adapted in this situation. Nevertheless, 
choosing the stratification variables is a rather complex issue and more work is needed in 
this direction. 

A challenging future research avenue concerns the use of auxiliary information at the 
estimation stage. While, in this paper, we have concentrated on the estimation of the 
median using the Horvitz-Thompson estimator or the poststratified estimator, more complex 
estimators using functional regression models can be developed. For example, it is possible 
to set a linear functional model which explains the functional variable using a scalar Xk 
and to develop a regression estimator for the median curve. Developing a general framework 
for regression estimators for the median curve is left for future studies. 

Acknowledgments 

The authors thank the two anonymous referees, and the associate editor for their constructive 
remarks that helped to improve the manuscript. 

Bibliography 

Brown, B.M. (1983) Statistical Use of the Spatial Median, Journal of the Royal Statistical 
Society, B, 45, 25-30. 

Cadre, B. (2001). Convergent estimators for the L 1 -median of a Banach valued random 
variable. Statistics, 35 (4), 509-521. 

Cardot, H., Cenac, P. and Zitt, P.-A. (2011). Efficient and fast estimation of the geometric 



26 



median in Hilbert spaces with an averaged stochastic gradient algorithm. Bernoulli, 
to appear. 

Cardot, H., Chaouch, M., Goga, C. and Labruere, C. (2010), Properties of design-based 
functional principal components analysis, Journal of Statistical Planning and Inference, 
140, 75-91. 

Cardot, H. and Josserand, E. (2011), Horvitz-Thompson estimators for functional data: 
asymptotic confidence bands and optimal allocation for stratified sampling, Biometrika, 
98, 107-118. 

Chaouch, M. and Goga, C. (2010), Design-based estimation for geometric quantile with 
application to outlier detection, Computational Statistics and Data Analysis, 54, 2214- 
2229. 

Chaudhuri, P. (1996) On a Geometric Notion of Quantiles for Multivariate Data, Journal 
of the American Statistical Association, 91, pp. 862-872. 

Chiky, R. and Hebrail, G. (2008). Summarizing distributed data streams for storage in 
data warehouses, in DaWaK 2008, I-Y. Song, J. Eder and T. M. Nguyen, eds. Lecture 
Notes in Computer Science, Springer, 65-74. 

Deville, J.C. (1999). Variance estimation for complex statistics and estimators: lineariza- 
tion and residual techniques. Survey Methodology, 25, 193-203. 

Deville, J. C. and Sarndal, C. E. (1992). Calibration estimators in survey sampling, Journal 
of the American Statistical Association, 87, 376-382. 

Fuller, W.A. (2009). Sampling Statistics. John Wiley and Sons. 

Gervini, D. (2008). Robust functional estimation using the spatial median and spherical 
principal components. Biometrika, 95, 587-600. 

Gini, K. and Galvani, L. (1929). Di talune estensioni dei concetti di media ai caratteri 
qualitativi. Metron, 8, 3-209. 

Goga, C. and Ruiz-Gazen, A. (2011) Efficient Estimation of Nonlinear Finite Population 
Parameters Using Nonparametrics, submitted. 

Gower, J.C. (1974). Algorithm as 78: The mediancentre. Journal of the Royal Statistical 
Society, Series C, Applied Statistics, 23, 466-470. 



27 



Haldane, J.B.S. (1948). Note on the median of a multivariate distribution. Biometrika, 
35, 414-417. 

Hayford, J. F. (1902). What is the center of an area, or the center of a population ? 
Journal of the American Statistical Association, 8, 47-58. 

Hampel, F.R. (1974). The influence curve and its role in robust statistics. Journal of the 
American Statistical Association, 69, 383-393. 

Hansen, M. H. and Hurwitz, W.N. (1943). On the theory of sampling from finite population 
Annals of Mathematical Statistics, 14, 333-362. 

Horvitz, D.G. and Thompson, D.J. (1952), A generalization of sampling without replace- 
ment from a finite universe, Journal of the American Statistical Association, 47, 663- 
685. 

Kemperman, J.H.B. (1987), The median of a finite measure on a Banach space, In: Dodge, 
Y. (Ed.), Statistical Data Analysis Based on the L\ Norm and Related Methods, North- 
Holland, Amesterdam, 217-230. 

Korn, E.L. and Graubard, B.I. (1999). Analysis of Health Surveys, Wiley, New York. 

Koenker, R., and Bassett, G. (1978) Regression Quantiles, Econometrica, 46, 33-50. 

Lehtonen, R. and Pahkinen, E. (2004). Practical Methods for Design and Analysis of 
Complex Surveys, Wiley, New York. 

Lohr, S. L. (1999). Sampling: Design and Analysis, Duxbury Press. 

Ramsay, J.O. and Silverman, B.W. (2005), Functional Data Analysis, 2nd edition, Springer, 
Berlin. 

Sarndal, C.E., Swensson, B. and Wretman, J. (1992). Model Assisted Survey Sampling. 
Springer- Verlag. 

Serfling, R. (1980). Approximation Theorems of Mathematical Statistics, Wiley, New York. 

Serfling, R. (2002) Quantile functions for multivariate analysis: approaches and applica- 
tions. Statistica Neerlandica, 56, 214-232. 

Small, C.G. (1990). A survey of multidimensional medians, International Statistical Review, 
58, 263-277. 



28 



Vardi, Y and Zhang, C.H. (2000). The multivariate Li-median and associated data depth. 
Proc. Natl. Acad. Sci. USA, 97, 1423-1426. 

von-Mises, R. (1947). On the asymptotic distribution of differentiable statistical functions. 
Annals of Mathematical Statistics, 18, 309-348. 

Weber, A. (1909), Uber Den Standard Der Industrien, Tubingen. English translation by 
C. J. Freidrich (1929), em Alfred Weber's Theory of Location of Industries, Chicago: 
Chicago University Press. 



29 



